Spatial clustering and denoising expressions¶
STAGATE is designed for spatial clustering and denoising expressions of spatial resolved transcriptomics (ST) data.
STAGATE learns low-dimensional latent embeddings with both spatial information and gene expressions via a graph attention auto-encoder. The method adopts an attention mechanism in the middle layer of the encoder and decoder, which adaptively learns the edge weights of spatial neighbor networks, and further uses them to update the spot representation by collectively aggregating information from its neighbors. The latent embeddings and the reconstructed expression profiles can be used to downstream tasks such as spatial domain identification, visualization, spatial trajectory inference, data denoising and 3D expression domain extraction.
Dong, Kangning, and Shihua Zhang. “Deciphering spatial domains from spatially resolved transcriptomics with an adaptive graph attention auto-encoder.” Nature Communications 13.1 (2022): 1-12.
import omicverse as ov
#print(f"omicverse version: {ov.__version__}")
import scanpy as sc
#print(f"scanpy version: {sc.__version__}")
ov.utils.ov_plot_set()
____ _ _ __ / __ \____ ___ (_)___| | / /__ _____________ / / / / __ `__ \/ / ___/ | / / _ \/ ___/ ___/ _ \ / /_/ / / / / / / / /__ | |/ / __/ / (__ ) __/ \____/_/ /_/ /_/_/\___/ |___/\___/_/ /____/\___/ Version: 1.5.3, Tutorials: https://omicverse.readthedocs.io/
Preprocess data¶
Here we present our re-analysis of 151676 sample of the dorsolateral prefrontal cortex (DLPFC) dataset. Maynard et al. has manually annotated DLPFC layers and white matter (WM) based on the morphological features and gene markers.
This tutorial demonstrates how to identify spatial domains on 10x Visium data using STAGATE. The processed data are available at https://github.com/LieberInstitute/spatialLIBD. We downloaded the manual annotation from the spatialLIBD package and provided at https://drive.google.com/drive/folders/10lhz5VY7YfvHrtV40MwaqLmWz56U9eBP?usp=sharing.
adata = sc.read_visium(path='data', count_file='151676_filtered_feature_bc_matrix.h5')
adata.var_names_make_unique()
reading data/151676_filtered_feature_bc_matrix.h5 (0:00:00)
sc.pp.calculate_qc_metrics(adata, inplace=True)
adata = adata[:,adata.var['total_counts']>100]
adata=ov.pp.preprocess(adata,mode='shiftlog|pearson',n_HVGs=3000,target_sum=1e4)
adata.raw = adata
adata = adata[:, adata.var.highly_variable_features]
adata
Begin robust gene identification
After filtration, 10747/10747 genes are kept. Among 10747 genes, 10747 genes are robust.
End of robust gene identification.
Begin size normalization: shiftlog and HVGs selection pearson
normalizing counts per cell The following highly-expressed genes are not considered during normalization factor computation:
[]
finished (0:00:00)
extracting highly variable genes
--> added
'highly_variable', boolean vector (adata.var)
'highly_variable_rank', float vector (adata.var)
'highly_variable_nbatches', int vector (adata.var)
'highly_variable_intersection', boolean vector (adata.var)
'means', float vector (adata.var)
'variances', float vector (adata.var)
'residual_variances', float vector (adata.var)
End of size normalization: shiftlog and HVGs selection pearson
View of AnnData object with n_obs × n_vars = 3460 × 3000
obs: 'in_tissue', 'array_row', 'array_col', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes'
var: 'gene_ids', 'feature_types', 'genome', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'percent_cells', 'robust', 'mean', 'var', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
uns: 'spatial', 'log1p', 'hvg'
obsm: 'spatial'
layers: 'counts'
We read the ground truth area of our spatial data
# read the annotation
import pandas as pd
import os
Ann_df = pd.read_csv(os.path.join('data', '151676_truth.txt'), sep='\t', header=None, index_col=0)
Ann_df.columns = ['Ground Truth']
adata.obs['Ground Truth'] = Ann_df.loc[adata.obs_names, 'Ground Truth']
sc.pl.spatial(adata, img_key="hires", color=["Ground Truth"])
adata.obs['X'] = adata.obsm['spatial'][:,0]
adata.obs['Y'] = adata.obsm['spatial'][:,1]
Training STAGATE model¶
Here, we used ov.space.pySTAGATE to construct a STAGATE object to train the model.
Because we build the spatial network based on spatial location, our network can be directly divided into subgraphs in the following form.
STA_obj=ov.space.pySTAGATE(adata,num_batch_x=3,num_batch_y=2,
spatial_key=['X','Y'],rad_cutoff=200,num_epoch = 1000,lr=0.001,
weight_decay=1e-4,hidden_dims = [512, 30],
device='cuda:0')
------Calculating spatial graph... The graph contains 3060 edges, 559 cells. 5.4741 neighbors per cell on average. ------Calculating spatial graph... The graph contains 3328 edges, 595 cells. 5.5933 neighbors per cell on average. ------Calculating spatial graph... The graph contains 3448 edges, 613 cells. 5.6248 neighbors per cell on average. ------Calculating spatial graph... The graph contains 3044 edges, 541 cells. 5.6266 neighbors per cell on average. ------Calculating spatial graph... The graph contains 3128 edges, 559 cells. 5.5957 neighbors per cell on average. ------Calculating spatial graph... The graph contains 3320 edges, 595 cells. 5.5798 neighbors per cell on average. ------Calculating spatial graph... The graph contains 20052 edges, 3460 cells. 5.7954 neighbors per cell on average.
STA_obj.train()
100%|██████████| 1000/1000 [00:18<00:00, 53.53it/s]
We stored the latent embedding in adata.obsm['STAGATE'], and denoising expression in adata.layers['STAGATE_ReX']
STA_obj.predicted()
adata
AnnData object with n_obs × n_vars = 3460 × 3000
obs: 'in_tissue', 'array_row', 'array_col', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'Ground Truth', 'X', 'Y'
var: 'gene_ids', 'feature_types', 'genome', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'percent_cells', 'robust', 'mean', 'var', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
uns: 'spatial', 'log1p', 'hvg', 'Ground Truth_colors', 'Spatial_Net'
obsm: 'spatial', 'STAGATE'
layers: 'counts', 'STAGATE_ReX'
Clustering the space¶
We can use GMM, leiden or louvain to cluster the space.
ov.utils.cluster(adata,use_rep='STAGATE',method='louvain',resolution=1)
ov.utils.cluster(adata,use_rep='STAGATE',method='leiden',resolution=1)
ov.utils.cluster(adata,use_rep='STAGATE',method='GMM',n_components=7,covariance_type='full',
tol=1e-9, max_iter=1000, random_state=3607)
running GaussianMixture clustering
finished: found 7 clusters and added
'gmm_cluster', the cluster labels (adata.obs, categorical)
Denoising¶
plot_gene = 'ATP2B4'
import matplotlib.pyplot as plt
fig, axs = plt.subplots(1, 2, figsize=(8, 4))
sc.pl.spatial(adata, img_key="hires", color=plot_gene, show=False, ax=axs[0], title='RAW_'+plot_gene, vmax='p99')
sc.pl.spatial(adata, img_key="hires", color=plot_gene, show=False, ax=axs[1], title='STAGATE_'+plot_gene, layer='STAGATE_ReX', vmax='p99')
